10分钟简单介绍pandas
首先,导入模块如下所示:
1 | import pandas as pd |
pandas数据结构:Series
Series可以简单地被认为是一维的数组。 Series 和一维数组最主要的区别在于 Series类型具有索引( index ),可以和另一个编程中常见的数据结构哈希( Hash )联系起来。
创建Series类型数据结构,如果没有传入索引,pandas默认的索引为从0开始的整数。
1 | s = pd.Series([1,3,5,np.nan,6,8]) |
1 | s |
0 1
1 3
2 5
3 NaN
4 6
5 8
dtype: float64
pandas数据结构:DataFrame
DataFrame 是将数个 Series 按列合并而成的二维数据结构,每一列单独取出来是一个 Series ,这和 SQL 数据库中取出的数据是很类似的。所以,按
列对一个 DataFrame 进行处理更为方便,用户在编程时注意培养按列构建数据的思维。 DataFrame 的优势在于可以方便地处理不同类型的列,因此,就不要考虑如何对一个全是浮点数的 DataFrame 求逆之类的问题了,处理这种问题还是把数据存成 NumPy 的 matrix 类型比较便利一些。
通过传入 numpy array数据创建 DataFrame:
1 | dates = pd.date_range('20130101', periods=6) |
1 | dates |
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
1 | df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) |
1 | df |
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 |
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 |
2013-01-03 | -0.245746 | -0.226585 | 1.749624 | 1.140817 |
2013-01-04 | 0.032400 | -0.264382 | 0.125095 | -1.322739 |
2013-01-05 | -2.260707 | 0.064878 | 0.231025 | 0.682991 |
2013-01-06 | 0.603739 | 1.490709 | 0.249649 | 1.822501 |
传入字典对象创建DataFrame:
1 | df2 = pd.DataFrame({ 'A' : 1., |
1 | df2 |
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1 | 2013-01-02 | 1 | 3 | test | foo |
1 | 1 | 2013-01-02 | 1 | 3 | train | foo |
2 | 1 | 2013-01-02 | 1 | 3 | test | foo |
3 | 1 | 2013-01-02 | 1 | 3 | train | foo |
1 | df2.F |
0 foo
1 foo
2 foo
3 foo
Name: F, dtype: object
1 | df2.A |
0 1
1 1
2 1
3 1
Name: A, dtype: float64
查看数据顶部或底部的几行:
1 | df.head(2) |
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 |
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 |
1 | df.tail(3) |
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 0.032400 | -0.264382 | 0.125095 | -1.322739 |
2013-01-05 | -2.260707 | 0.064878 | 0.231025 | 0.682991 |
2013-01-06 | 0.603739 | 1.490709 | 0.249649 | 1.822501 |
显示行列索引和里面的值;
1 | df.index |
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
1 | df.columns |
Index([u'A', u'B', u'C', u'D'], dtype='object')
1 | df.values |
array([[ 0.21287973, 0.35172526, -1.35057903, -0.10740265],
[-0.85790301, -1.78332415, 1.16288782, -0.48822551],
[-0.24574644, -0.22658458, 1.74962416, 1.14081656],
[ 0.03240016, -0.26438175, 0.12509531, -1.32273918],
[-2.26070679, 0.06487812, 0.23102475, 0.68299111],
[ 0.60373902, 1.4907093 , 0.24964875, 1.82250141]])
显示数据的简单统计:
1 | df.describe() |
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | -0.419223 | -0.061163 | 0.361284 | 0.287990 |
std | 1.026018 | 1.061053 | 1.056953 | 1.148160 |
min | -2.260707 | -1.783324 | -1.350579 | -1.322739 |
25% | -0.704864 | -0.254932 | 0.151578 | -0.393020 |
50% | -0.106673 | -0.080853 | 0.240337 | 0.287794 |
75% | 0.167760 | 0.280013 | 0.934578 | 1.026360 |
max | 0.603739 | 1.490709 | 1.749624 | 1.822501 |
数据转置:
1 | df.T |
2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 | 2013-01-05 00:00:00 | 2013-01-06 00:00:00 | |
---|---|---|---|---|---|---|
A | 0.212880 | -0.857903 | -0.245746 | 0.032400 | -2.260707 | 0.603739 |
B | 0.351725 | -1.783324 | -0.226585 | -0.264382 | 0.064878 | 1.490709 |
C | -1.350579 | 1.162888 | 1.749624 | 0.125095 | 0.231025 | 0.249649 |
D | -0.107403 | -0.488226 | 1.140817 | -1.322739 | 0.682991 | 1.822501 |
按某个索引排序:
1 | df.sort_index(axis=1,ascending=False) |
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | -0.107403 | -1.350579 | 0.351725 | 0.212880 |
2013-01-02 | -0.488226 | 1.162888 | -1.783324 | -0.857903 |
2013-01-03 | 1.140817 | 1.749624 | -0.226585 | -0.245746 |
2013-01-04 | -1.322739 | 0.125095 | -0.264382 | 0.032400 |
2013-01-05 | 0.682991 | 0.231025 | 0.064878 | -2.260707 |
2013-01-06 | 1.822501 | 0.249649 | 1.490709 | 0.603739 |
按数据的值排序:
1 | df.sort_values(by='B') |
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 |
2013-01-04 | 0.032400 | -0.264382 | 0.125095 | -1.322739 |
2013-01-03 | -0.245746 | -0.226585 | 1.749624 | 1.140817 |
2013-01-05 | -2.260707 | 0.064878 | 0.231025 | 0.682991 |
2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 |
2013-01-06 | 0.603739 | 1.490709 | 0.249649 | 1.822501 |
选出某一类:(同df.A)
1 | df['A'] |
2013-01-01 0.212880
2013-01-02 -0.857903
2013-01-03 -0.245746
2013-01-04 0.032400
2013-01-05 -2.260707
2013-01-06 0.603739
Freq: D, Name: A, dtype: float64
通过[]切分出几行:
1 | df[0:3] |
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 |
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 |
2013-01-03 | -0.245746 | -0.226585 | 1.749624 | 1.140817 |
df['20130102':'20130104']
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 |
2013-01-03 | -0.245746 | -0.226585 | 1.749624 | 1.140817 |
2013-01-04 | 0.032400 | -0.264382 | 0.125095 | -1.322739 |
通过标签选择:
1 | df.loc[dates[0],['A','B']] |
1 | A 0.212880 |
通过位置选取:
1 | df.iloc[1:3,0:2] |
A | B | |
---|---|---|
2013-01-02 | -0.857903 | -1.783324 |
2013-01-03 | -0.245746 | -0.226585 |
reindex方法,能够增加行和列:
1 | df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) |
1 | df1.loc[dates[0]:dates[1],'E'] = 1 |
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 | 1 |
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 | 1 |
2013-01-03 | -0.245746 | -0.226585 | 1.749624 | 1.140817 | NaN |
2013-01-04 | 0.032400 | -0.264382 | 0.125095 | -1.322739 | NaN |
丢失数据的处理:
去掉有丢失数据的所有行:
1 | df1.dropna(how='any') |
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 | 1 |
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 | 1 |
填充丢失数据
1 | df1.fillna(value=5) |
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 | 1 |
2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 | 1 |
2013-01-03 | -0.245746 | -0.226585 | 1.749624 | 1.140817 | 5 |
2013-01-04 | 0.032400 | -0.264382 | 0.125095 | -1.322739 | 5 |
判断是否有丢失数据:
1 | pd.isnull(df1) |
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | False | False | False | False | False |
2013-01-02 | False | False | False | False | False |
2013-01-03 | False | False | False | False | True |
2013-01-04 | False | False | False | False | True |
读取文件
写csv文件:
1 | df.to_csv('foo.csv') |
读csv文件:
1 | pd.read_csv('foo.csv') |
Unnamed: 0 | A | B | C | D | |
---|---|---|---|---|---|
0 | 2013-01-01 | 0.212880 | 0.351725 | -1.350579 | -0.107403 |
1 | 2013-01-02 | -0.857903 | -1.783324 | 1.162888 | -0.488226 |
2 | 2013-01-03 | -0.245746 | -0.226585 | 1.749624 | 1.140817 |
3 | 2013-01-04 | 0.032400 | -0.264382 | 0.125095 | -1.322739 |
4 | 2013-01-05 | -2.260707 | 0.064878 | 0.231025 | 0.682991 |
5 | 2013-01-06 | 0.603739 | 1.490709 | 0.249649 | 1.822501 |
1 |